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Abstract 


The semiconductor industry roadmap projects that 
advances in VLSI technology will permit more than 
one billion transistors on a chip by the year 2010. 
The MIT Raw microprocessor is a proposed architec- 
ture that strives to exploit these chip-level resources 
by implementing thousands of tiles, each comprising 
a processing element and a small amount of memory, 
coupled by a static two-dimensional interconnect. A 
compiler partitions fine-grain instruction-level paral- 
lelism across the tiles and statically schedules inter-tile 
communication over the interconnect. Because Raw 
microprocessors fully expose their internal hardware 
structure to the software, they can be viewed as a gi- 
gantic FPGA with coarse-grained tiles, in which soft- 
ware orchestrates communication over static intercon- 
nections. 

One open challenge in Raw architectures is to de- 
termine their optimal grain size and balance. The 
grain size is the area of each tile, and the balance is 
the proportion of area in each tile devoted to mem- 
ory, processing, communication, and I/O. If the total 
chip area is fixed, more area devoted to processing will 
result in a higher processing power per node, but will 
lead to a fewer number of tiles. 

This paper presents an analytical framework using 
which designers can reason about the design space of 
Raw microprocessors. Based on an architectural model 
and a VLSI cost analysis, the framework computes the 
performance of applications, and uses an optimization 
process to identify designs that will execute these ap- 
plications most cost-effectively. 

Although the optimal machine configurations ob- 
tained vary for different applications, problem sizes 
and budgets, the general trends for various applica- 
tions are similar. Accordingly, for the applications 
studied, assuming an 1 billion logic transistor equiv- 
alent area, we recommend building a Raw chip with 


approximately 1000 tiles, 30 words/cycle global I/O, 
20K bytes of local memory per node, 3-4 words/cycle 
local communication bandwidth, and single-issue pro- 
cessors. This configuration will give performance near 
the global optimum for most applications. 


1 Introduction 


Advances in semiconductor technology have made 
possible the integration of multiple functional units, 
large cache memories, reconfigurable logic arrays and 
peripheral functions into single-chip microprocessors. 
Unfortunately, increases in the performance of con- 
temporary microprocessors have come at the cost of 
increasing inefficiencies in silicon area usage. The in- 
efficiencies arise from the complexity of designs that 
use hardware support to exploit more instruction level 
parallelism. 

Maintaining a rapid increase in microprocessor per- 
formance will require a cost efficient utilization of sili- 
con area. The MIT Raw microprocessor is a proposed 
architecture that exposes its internal hardware struc- 
ture to the compiler, so that the compiler can deter- 
mine and orchestrate the best mapping of an appli- 
cation to the hardware. A Raw microprocessor [1] is 
reminiscent of a coarse-grained FPGA and comprises 
a replicated set of tiles coupled together by a set of 
compiler orchestrated, pipelined, switches (Figure 1). 
Each tile contains a simple RISC-like processing core 
and an SRAM memory for instructions and data. In- 
struction memory allows the multiplexing of the com- 
pute logic on a cycle by cycle basis. SRAM mem- 
ory distributed across the tiles eliminates the memory 
bandwidth bottleneck, provides low latency to each 
memory module, and prevents off-chip 1/O latency 
from limiting effective computational throughput. 

The tiles are interconnected by a high-speed 2D 


mesh network, allowing inter-tile communications to 
occur with register-like latencies. The switches them- 
selves contain some amount of SRAM so that the com- 
piler can load into the switch a program that multi- 
plexes the interconnect in a cycle by cycle fashion, just 
as in a virtual-wires based multi-FPGA system [4]. 

A typical Raw system includes a Raw microproces- 
sor coupled with off-chip RDRAM (RamBus DRAM) 
through multiple high bandwidth paths. The two level 
memory hierarchy, namely, a local SRAM memory at- 
tached to each tile inside the Raw chip, and a large 
external RDRAM memory, is necessary to be able to 
solve large problems that exceed the size of the on-chip 
memory. 

Raw architectures achieve the performance of 
FPGA-based custom computing engines by exploit- 
ing fine-grained parallelism and fast static communi- 
cation, and by exposing the low-level hardware details 
to facilitate compiler orchestration. Unlike FPGA sys- 
tems, however, Raw machines support instruction se- 
quencing and are more flexible because the execution 
of a new operation can be accomplished merely by 
pointing to a new instruction. Compilation in Raw 
is faster than in FPGA systems because it binds into 
hardware commonly used compute mechanisms such 
as ALUs and memory paths, thereby eliminating re- 
peated low-level compilations of these macro units. 
Binding of common mechanisms into hardware also 
yields better execution speed, lower area, and better 
power efficiency than FPGA systems. 

The designer of an FPGA device or a Raw micro- 
processor is faced with the challenge of determining 
the best division of VLSI resources among comput- 
ing, memory, and communication. This challenge is 
termed the balance problem. Furthermore the design- 
ers of both an FPGA and a Raw device must address 
the grain size issue — in other words, whether to im- 
plement a few powerful tiles, or whether to use many 
small tiles each with a lower performance. 

This paper presents an analytical framework with 
which designers can reason about the division of re- 
sources in a VLSI chip. Although our analysis in 
this paper is focussed on the Raw microprocessor, the 
analysis generalizes to other architectures. Our ob- 
jective in this paper is to gain more insight into cost- 
performance optimal designs given a fixed amount of 
resources. 

The framework presented in this paper focuses on 
the performance requirements of applications, intro- 
duces an architecture model, a cost model and a per- 
formance model for applications, and defines an op- 
timization process to search for performance optimal 


designs given a cost constraint. 

The architecture model defines an architecture 
based on parameters that include the number of tiles 
P, the processing power of each tile p, the amount of 
memory in each tile m, and the communication band- 
width out of each tile c. The cost model estimates 
the cost in terms of chip area of realizing the given 
architecture with the specified set of parameters. The 
performance model estimates the runtime of each ap- 
plication as a function of the problem size. Perfor- 
mance estimation is based on both (1) a characteriza- 
tion of the application and its algorithms in terms of 
its requirements including processing steps, memory 
and communication volumes, and (2) the architecture 
model. 

Together with a cost constraint defined in terms 
of the cost model, our performance model allows us 
to perform a constrained optimization on the inde- 
pendent architectural variables. We can, for example, 
compute the points or contours in the architectural 
space that correspond to the best performance for a 
given cost, lowest cost for a given level of performance, 
or best efficiency defined by performance/cost. 

The algorithms used in this study have been 
adapted to the Raw system architecture illustrated in 
Figure 1 by first partitioning them into subproblems 
that can fit within the Raw chip. Each subproblem 
is loaded from the external global RDRAM memory 
into the set of local memories in the tiles. Compu- 
tation occurs on the subproblem, and the results are 
stored back into external RDRAM. All the subprob- 
lems are visited (possibly multiple times) in sequence. 
The algorithmic slowdown due to blocking the prob- 
lem in this manner is accurately modeled. Each sub- 
problem is solved in parallel with a blocking algorithm. 
Applications studied in this paper include Jacobi Re- 
laxation, Dense Matrix Multiply, Nbody, FFT, and 
Largest Common Subsequence. 

The specific contribution of this paper include: 


e A general framework for reasoning about the de- 
sign space of VLSI-based parallel architectures in- 
cluding models for cost and performance. 


e Insights on optimal grain size and balance in Raw 
microprocessors. 


The remainder of this paper is organized as follows. 
Section 2 describes the three models developed in this 
paper: the performance model, the cost model and the 
application model and gives a qualitative analysis of 
cost and performance. Section 2.7 formulates the op- 
timization process based on previous model assump- 


Raw Microprocessor 


Figure 1: Raw system composition. A typical Raw system includes a Raw microprocessor coupled with off-chip 
DRAM and stream IO devices. Each Raw tile contains a simple RISC-like processor, an SRAM memory for 
instructions and data, and a switch. The tiles are interconnected in a 2D mesh network that is orchestrated by 
the compiler. The switches themselves contain some amount of SRAM so that the compiler can load into the 
switch a program that multiplexes the interconnect in a cycle by cycle fashion, just as in a virtual-wires based 


multi-FPGA system. 


tions, and Section 3 gives our experimental results. 
Section 4 concludes the paper. 


2 Framework 


This section presents the analytical framework used 
in analyzing candidate designs in terms of their grain 
size and balance. We first start with a motivation for 
a study of grain size issues. 


2.1 Motivation 


Two key questions in the design of a Raw micro- 
processor involve the grain size of its tiles and their 
balance. The grain size reflects the sizes of various 
components inside the tiles such as memory, process- 
ing, and communication. A very coarse grain design 
would involve multiple-issue superscalars for process- 
ing and large local memories. Very fine grain designs 
would be similar to contemporary FPGAs and include 
a few bits worth of logic and memory within each tile, 
and a few wires connecting the individual tiles. De- 
signs with a moderate grain size would involve very 
simple single-issue processors in each node. 

Grain size and balance play a large part in deter- 
mining the efficiency or performance per unit cost of 


a machine assuming a fixed total budget. If an engi- 
neer builds a small number of very large (coarse grain) 
nodes, a point of diminishing returns is reached where 
node performance increases very slowly (if at all) as 
node size is increased. On the other hand, building 
a large number of very small (fine grain) nodes will 
also result in diminishing returns as the communica- 
tion costs dominate. The highest efficiency occurs at 
an optimal point between the two extremes. Simi- 
larly, as observed by Kung and others [11, 5], there is 
an optimal balance of resources between the proces- 
sor, memory, and communication components within 
a node. 


While there has been much debate on this topic, 
few concrete results have been reported. Machine bal- 
ance and grain size continues to be determined more 
by convenience and market forces than by engineer- 
ing analysis. Our primary motivation in undertaking 
this study is to provide an analytical framework to en- 
able engineers to obtain insights into the tradeoffs in 
choosing various machine parameters. 


Let us first provide an overview of the framework. 
Table 1 summarizes our notation organized by model 
category. Throughout the paper, execution times are 
measured in machine cycles, information in units of 
machine words, and cost in SRAM bit equivalents 
(Sbe). As discussed in Section2.4, an Sbe is the area 


ARCHITECTURE MODEL 
processing power of each tile 
amount of SRAM in a tile 
local communication bandwidth 
single hop interconnect latency of a word 
software overhead for communication 
average network distance traversed by messages 
DRAM latency for global communication 


COST MODEL 


processor cost per tile 
memory cost per tile 


global latency cost for entire Raw system 


Kog(bg) global bandwidth cost for entire Raw system 
total cost of Raw system 


APPLICATION MODEL 


subproblem size: part of the original problem requiring one global step 


total amount of computation required per tile to solve the problem 
total amount of local memory required per tile 

total amount of local memory per tile required to hold a subproblem 

total amount of buffer per tile for overlapping local communication 


Table 1: Overview of model parameters and functions. 


occupied by one bit of SRAM memory. 
2.2 Overview of the Framework 


Let us overview our analytical framework illus- 
trated in Figure 2 by considering a simple machine 
model. In its simplest form, a parallel machine can be 
characterized by the number of tiles or nodes P, the 
processing power of each node p (operations per cy- 
cle), communication bandwidth of each node c (words 
per cycle), and the amount of local memory per node 
m. (words). 

For a given problem size and partitioning strategy, 
an application can be described by its processing, com- 
munication and memory requirements, or Rp, (opera- 
tions to be performed), R. (words to be communi- 
cated) and R,, (words). 

The performance of the application in terms of its 


runtime JT is derived from the application require- 
ments and the architectural model. If the processing 
time T, = “2 and the communication time k= He 
then if processing and communication is fully over- 
lapped, the runtime is given by T = maz(T,,T,). 

We use cost models K,(p), K-(c), Km(m) to map 
the machine parameters P, p,c,m into costs. In other 
words the processor cost model K,(p) provides the 
area cost of implementing a processor that can per- 
form p operations per cycle. The total machine cost 
for a P processor machine is then K = P(K,+K.+ 
Kin). 

Given an application with a fixed problem size N 
and an area budget B, a constrained optimization 
problem is defined with the objective of finding the 
optimal machine configuration that gives the smallest 
runtime for that budget. In other words the frame- 
work finds the set of architectural parameters P, p,c,m 
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Figure 2: Analytical framework. The key components 
of the framework are the models and the optimiza- 
tion process. Given an application with an associated 
problem size and a fixed budget, the constraint equa- 
tions are derived for the optimization. The nonlinear 
optimization process searches the machine configura- 
tion space that gives the minimal runtime for the ap- 
plication. 


that yield a minimum value for T’, given that the cost 
& cannot exceed the available budget B. Or more 
formally, 

find P,p,c,m 

to minimize T = mazx(T,, T-), 

subject to B> K. 

As discussed in more detail later, the optimization 
process is sped up by a set of balance constraints. The 
balance constraints state that for the optimal solution 
the computation time and communication time must 
be equal, and that the physical memory should fit the 
problem. The balance constraints greatly reduce the 
size of the search space, and thus the complexity of 
the optimization procedure. 

The following sections discuss each of the compo- 
nents of the framework and the optimization process 
in more detail. 


2.3. Architecture model 


This section discusses parameters necessary for 
architecture characterization. Although several ap- 
proaches to modeling the performance of a parallel 
computer have been proposed in the literature [2, 3], 
none are completely suited to modeling fine-grain par- 
allel systems built on a chip. Figure 3 shows our char- 
acterization of a Raw system using the parameters de- 
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Figure 3: A four node illustrative Raw system char- 
acterized by the parameters < P,p,m,c,l,0,b9,lg > 
where the processing power per node in operations 
per cycle is p, the amount of SRAM memory per node 
is m, the local communication bandwidth per node in 
words per cycle is c, the software overhead for a com- 
munication event in cycles is o, the single hop commu- 
nication latency is J, the global off-chip communication 
bandwidth per Raw chip in words per cycle is b,, the 
RDRAM latency expressed in cycles is J,. 


scribed below. Our machine characterization differs 
from previous ones in the sense that it captures both 
local and global communication performance, and in- 
cludes software overheads. 

We choose as independent parameters the number 
of nodes, P, the processing power per node in opera- 
tions per cycle, p, the memory per node m in words, 
the local communication bandwidth per node in words 
per cycle, c, the software overhead for communication 
in cycles, 0, the single hop latency of the network, J, 
the global off-chip communication bandwidth per chip 
in words per cycle , b,, and the RDRAM latency ex- 
pressed in cycles, ,. 

As an example, sending a local inter-tile message 
of length Z words first involves spending o cycles in 
launching the message. ‘The message header word 
travels on average a distance of kg hops in the net- 
work, using / cycles per hop. Because the bandwidth 
out of a node is c words per cycle, subsequent mes- 
sage words take 4 to enter the network. The receiving 
tile would also spend o cycles receiving the message. 
Thus, the communication time per message is 


1 
T. = 20+ kal + (L~1)= 


(1) 


Writing a block of data to the off-chip RDRAM 


memory first involves an overhead o associated with 
starting up global communication. The latency of ac- 
cessing the DRAM will be the sum of the latency of 
traversing the interconnection network in one dimen- 
sion (kgl/2) plus 1, the DRAM latency. (We divide 
by two to indicate that RDRAM memory messages 
do not have to traverse both the X and Y network di- 
mensions) The transfer rate of subsequent words will 
be the minimum of the local communication band- 
width and the global communication bandwidth per 
tile (since multiple tiles might be writing external 
memory). Thus the time for a writing a block of size 
L to memory is, 


1 P 
T,; =o+ Ray yy, + (£—1)maz (A=) (2) 
2 Cbg 


Communication locality can be captured at the ap- 
plication level by accounting for it in the average dis- 
tance that messages travel (kq). We ignore contention 
effects (e.g. resource and network contention) also be- 
cause we assume that the compiler can statically or- 
chestrate communication events much as in a virtual- 
wires system. We also use a conservative approach in 
defining applications’ communication requirements. 


2.4 Cost model 


We use silicon area as a measure of cost. Silicon 
area reflects the fundamental cost of building a com- 
ponent and is a good basis for comparing alternatives 
as opposed to market price which includes many artifi- 
cial factors. The cost model is based on CMOS micro- 
processors, SRAM and DRAM memories, and a mesh 
interconnection technology. For simplicity we consider 
the off-chip RDRAM memory free. Although our as- 
sumptions may change specific numerical results, the 
methodology for determining balance and grain size 
remains the same. 

We normalize cost to units of SRAM bits, viz. one 
bit of SRAM takes one unit of area and therefore one 
unit of cost. We express the cost of all other compo- 
nents in terms of SRAM bit equivalents (She). 

We use the notion of relative density to enable the 
normalization of logic, memory and communication 
areas into units of SRAM bit equivalents. Relative 
density captures the area impact of wires and more ir- 
regular structures such as logic areas versus the more 
regular memory arrays. Although an SRAM bit com- 
prises typically 4 to 6 transistors we observe that the 
area it occupies is similar to the area of a logic transis- 
tor in a CPU die because of its regular structure and 
therefore its higher relative density. Thus, the chip 


size expressed in Sbe units is equivalent to the total 
number of transistors for logic areas. 


Relative density 


CPU logic transistor 


SRAM bit 
DRAM bit 1/16 Sbe 


Table 2: Relative densities of constituent VLSI com- 
ponents. 


A DRAM bit is realized with one transistor and 
the area it occupies is 10-16 times smaller than an 
SRAM bit area. We arrived at this conclusion as the 
typical SRAM cell requires a wire grid of dimension 
3 x 4 compared to a DRAM cell implemented on the 
intersection of two wires. Factors such as the number 
of metallic layers may change the relative density rela- 
tions as more layers increase the density of logic areas. 
The logic area density is also reduced because of the 
greater amount of area devoted to wiring. 

The following cost functions are based on empirical 
observations and statistics gathered on current imple- 
mentations of superscalars and router chips. 
Processor cost K, The processor cost model com- 
putes the area cost as a function of p. We find it 
convenient to relate p to cost k, using an intermedi- 
ate parameter i, which is the number of issue units i 
in the processor. Thus, 7 = 4 implies a 4-Way super- 
scalar with a maximum of 4 operations per cycle. 

We model the relationship between processing cost 
and instruction issue structure as a quadratic curve, 
which captures the cost increase due to multiple issue 
superscalars. 


Kp(t) = B+ Kps(é a 1)? (3) 


In the above, a cost of B, is required to achieve a 
single issue processor with i = 1. 

We relate processing power p and the number of 
issue units 2 using: 


p=vi (4) 


This model captures the relationship between per- 
formance and cost due to more aggressive clock rates 
of lower issue processors. Typically single issue designs 
obtain 1.6-2 times faster clock rates than correspond- 
ing high-issue rate processors. It also captures the 
fact that it is easier to obtain performance close to the 
theoretical maximum cycles per instruction in lower- 
issue processors as they require a smaller amount of 
instruction-level parallelism in applications. 


Studying the layout of some simple RISC processors 
[6, 14, 13, 8] leads to a base cost of B, = 2.5 x 10° 
transistor. That is, a minimal single issue 64 bit pro- 
cessor can be built in the area of 250K SRAM bits 
or with 250K logic transistors. A cost constant of 
Ky, = 4 x 10° She, was arrived at from the study of 
some high-end processors [22, 20, 21, 8]. 

For validation, Figure 4 compares the number of 
transistors dedicated to logic in several superscalar 
microprocessors with our cost model for K,(i). We 
observe that for higher-issue superscalars the varia- 
tion in the number of transistors dedicated to logic 
areas is large. This variation is caused by important 
differences in implementation of components like issue 
structure, scheduling and memory interfaces. A more 
detailed cost model for superscalars may also deal with 
the cost impact of dynamic or static issue structures, 
scheduling and memory interfacing. 

Memory cost. We approximate memory cost as a 
linear function of capacity m. 


Kn(m) = Bh +Wm (5) 


Here, m is the memory size in words, Ky, is the cost 
per word of memory, and B,, is the fixed overhead 
cost of the memory. This overhead includes logic for 
translation, address decode, data multiplexing, and 
memory peripheral circuitry. For our calculations, we 
assume that W(wordsize) = 64, and the overhead, 
Bm, is 5 x 104. 


Communication cost. Main components of a typ- 
ical router comprise a routing module, a crossbar ar- 
biter, and input output modules often including large 
FIFOs. We observe that most of the area in current 
router chips are taken up by FIFOs and pad frames 
(c.a. 20%). The amount of FIFOs depends on such 
factors as the number of virtual channels. The area of 
queues reflects size of message flits and a length which 
is typically 16-20 flits. A flit is the number of bits 
transferred in one cycle, and therefore it also equals 
c expressed in bits. One word per cycle communica- 
tion bandwidth thus requires a flit size of one word. 
Although not necessary, we also assume the flitsize is 
equal to the physical channel width. We denote the 
dimension of the network as n. The total number of 
bidirectional channels is then 2n. Our results focus on 
two-dimensional networks, so n = 2 for most of this 
paper. Logic areas such as the crossbar usually occu- 
pies a small part of the total area. The cost function 
for the routers is described in the following equation: 


K.(c) = Be+K.sW x FIFOILx 2nx SetofQ xc. (6) 
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Figure 4: Comparison between the processor cost 
function K,(4) and cost of logic areas in current su- 
perscalar microprocessors. Cache memory areas are 
factored out. 


In the equation above, FI F Ol is the length of the 
FIFOs and Setof@ is the number of queue sets due 
to virtual channels. Our results use SetofQ = 1. The 
communication cost factor, Kes, is derived by fitting 
the cost function equation with the areas of routing 
chips shown in Table 3. 

For our calculations, we use K,, = 25. For example, 
a router with a 64 bit flit size and with one set of 
queues, each with length 16 flits, takes approximately 
125000 logic transistor area in our model. 

The base area for a router, B, is estimated at 
2.5 x 104. We arrive at this from a study of sim- 
ple routers [10, 9, 6, 15]. Examples of routers with 
the number of transistors used in current implementa- 
tions are shown in Table 3. The The estimates using 
our communication cost model are also shown. The 
comparison indicates that our cost model reflects rel- 
atively accurately the area occupied by these routers 
except the RDT [7] router chip that has more than half 
of its area devoted to a multicast mechanism module 
and a bit-map generator. 


Global communication cost. We approximate 
global communication cost as a linear function of 
global off-chip communication capacity. The base area 
for global I/O, By, = 10%, is estimated to be some- 
what smaller than a simple router area as no routing 
functions are necessary. The global communication 
bandwidth is limited by the maximum number of pins 
a packaging technology will allow. As current micro- 
processor packaging technologies use from one hun- 
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Table 3: Important cost factors for router chips. In the Type column we give the number of virtual channels 
where necessary, e.g 2u means 2 virtual channels. The second and third columns compare the actual number and 
the estimated number of transistors. With Flits we show the flit size or the number of bits transferred in one 
cycle. FIT FOI shows the length of FIFOs in flits and SetofQ shows the set of queues in the design often reflecting 


the number of virtual channels. 


dred to several hundred pins, we assume that in 10- 
12 years packaging will allow no more than roughly 
2000 pins. The maximum possible global bandwidth 
is then bmaz = 2000/W. The global communication 
cost factor, Ky, = 10° multiplied with the wordsize 
is approximately the cost in SRAM bit equivalents of 
one word per cycle of global I/O bandwidth. 


_ {0 if by > bmaz 
Ko (bg) = { Bog + KosgWb, otherwise (7) 


Global latency cost. For simplicity we assume this 
cost as constant reflecting the more or less constant 
speed of DRAM access over time. Bj, is estimated at 
10°. 

Kg (Z,) = Big (8) 


Total cost of the system. The total cost of the 
system is equal to the sum of its components. 


K(z) = P(Ky(p)+K-(c)+Km(m))+Kog (bg) +Kig( ) 


2.5 Application model 


The application model contains functions and pa- 
rameters that are necessary for application perfor- 
mance characterization. To predict the performance 
of an application with a particular machine configu- 
ration, we assume that the resource demands are uni- 
form over time and that processing, local and global 
communication can be completely overlapped. Some 
algorithms, such as those used in dynamic program- 
ming, also require the estimation of the algorithmic 
imbalance or the idle time due to synchronization over- 
head. Applications with several phases can be han- 
dled by dividing the application into its phases and 
characterizing each phase separately. Our assumption 
that processing, local and global communication are 


overlapped imposes constraints on how the problem 
is partitioned and on the total amount of memory re- 
quired. As we will show later, besides the memory 
needed to hold the problem, local and global commu- 
nication buffers are required in order to be able to 
overlap communication times. 

We will exemplify the concepts of this section by 
analyzing the Jacobi relaxation problem. The require- 
ments of the other applications considered in this pa- 
per are presented in the Appendix. The Jacobi Relax- 
ation problem is an iterative algorithm which, given a 
set of boundary conditions, finds discretized solutions 
to differential equations of the form V?A+4+ B = 0. 
Each step of the algorithm replaces the value at each 
node of a grid with the average of the values of its 
nearest neighbors. 

The original Jacobi problem defined by a grid of size 
N is partitioned in subgrids of size N’' as illustrated 
in Figure 5. Each subgrid or subproblem is solved by 
storing the subproblem of size N’ in the internal mem- 
ory of a Raw microprocessor and running a blocking 
relaxation algorithm. After a given number of phases, 
the subgrid is stored in external RDRAM, and the 
next subgrid is loaded. Clearly, a given subgrid has to 
be loaded and operated upon multiple times to reflect 
the effect of synchronization with the values computed 
in neighboring subgrids. 

Because, values from neighboring subgrids do not 
impact the relaxations on a given subgrid stored in the 
microprocessor, the number of iterations needed for 
convervence increases. We choose is; = VN’/2 as the 
number of iterations after which resynchronizations 
must occur between subproblems. Starting with some 
boundary conditions this means propagating border 
values to all points in a subproblem. We chose the 
total number of iterations as being i; = N? giving an 
error reduction factor of ten. 


Let us analyze the requirements of this application. 
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Figure 6: Graphical illustration of the solution to the optimization problem; example shown is for Jacobi Relax- 
ation (one iteration), gridsize 4000 x 4000 points, software overhead is zero, and global latency equals 100 cycles. 
The two surfaces correspond to the runtime performance function and a combined equation for the constraint 
functions. The intersection of the two surfaces determines the points that give balanced machine configurations. 
The points (Copt, Popt) that correspond to the smallest runtime are the global optimal solutions of the optimization. 
Other parameters such as processing power and memory size can be determined from the constraint equations 


by substitution. 


Figure 5: Jacobi Relaxation. The problem of size N 
is first partitioned in subproblems of size N‘'. Each 
subproblem is solved with blocking on P processors. 
Each processor receives bordering data from its four 
neighbors and sends its data along borders to its neigh- 
bors. Subproblems are resynchronized after a number 
of iterations. 


Required processing per node R,. This require- 
ment reflects the total amount of computation re- 
quired per Jacobi node given the algorithmic assump- 
tions described above. The total number of operations 
for each point are three additions and one multiplica- 
tion. 


Pe [l,N] (10) 


Required amount of memory words per node 
R,,. The required memory is comprised by the mem- 
ory required to solve the subblock of size N’ and also 
the memory buffers needed to overlap local and global 
communication. 


N' N' N' N' |.N! 
= —+4,/— + 2— = 3— + 44/—. (11 
Rm P - P = P =p = P ay) 


Required number of words of local communi- 
cation per node R,. The required local communi- 
cations is the total amount of data sent or received 
during the whole execution time. For any iteration 
each processor requires the bordering points from its 
neighbor processors. 


Required local communication events R,. 
These events incur a software penalty for initiating 
a communication step. It reflects the total number of 
times a local send or receive is issued. 


Ro =R.x 


(13) 
N! 
P 


Required latency of events R;. Reflects the total 
number of times a local send is issued. 


Ri = R. x 


(14) 
N! 
2 P 


Required global communication Ry,. Reflects the 
total amount of words of global communication per 
chip. 


(15) 


Required global communication events R,,. Re- 
flects the total times global sends or reads are initiated 
per chip. 


= Rog 
ei 


Rig (16) 


2.6 Performance Functions 


The performance functions estimate the running 
time of an application in terms of application require- 
ments and architecture parameters. 


Runtimes < T,T,,T.,T, >. Let the times for pro- 
cessing, local communication, and global communica- 
tion be T,,, T,, and T, respectively. Under the assump- 
tions that local and global communication time are 
overlapped with computation, the parallel runtime is 
defined as the maximum of these times. 


T = max(Tp,T.,T,) 

Tr = “ + Roo + Rigo 
Re 

Leo = ae + Rikal 
R k 

T, = 24 Ry —*l+ Rigly (17) 
by 2 


As an example, if the number of operations that 
must be performed is Rp and the processing power is 
p operations per cycle, then the processing time is sim- 
ply R,/p. Similarly, if the number of events incurring 
the message overhead (0 cycles) is R,, then the time 
wasted in message overhead activity is R,o. 


2.7 The optimization problem 


In this section we describe in more detail the opti- 
mization procedure. The optimization procedure is 
also illustrated graphically in Figure 6. 


The problem solved is the following constrained based 
nonlinear optimization problem: 


Given: a fixed chip area or budget B and a problem 
size N 


Objective: 
min (T(N, N',2x)) (18) 


subject to the constraints given below, where z is 
a specific 
machine configuration < p,P,m,c,0,1,bg,lg >. The 
solution of this optimization is the optimal machine 
configuration: Zop_ =< p, P,m,c¢,0,1,b9,lg >ope and 
and the optimal subproblem size: Nj, 
Constraints: 

1. Budget B must be greater or equal than the 
total cost. The total cost of the system is computed 
as the sum of its components. 


B> P(K,+K.+Km)+Kig+Kig (19) 


It is expedient to use an additional set of balance 
constraints as given below, when the communication 
and computation are overlapped. The balance con- 
straints focus the search for the optimal solution to 
balanced machine configurations. In other words, sec- 
ond and third equations state that communication and 
computation times should be equal. If they are not 
equal, we can take resources from the faster compo- 
nent without increasing runtime. The fourth balance 
constraint states that the memory should fit the prob- 
lem. If the memory is larger than this amount, it can 
be reduced without impacting performance. When lo- 
cal, global communication times are equal and mem- 
ory fits the problem, the machine configuration is bal- 
anced for the application. In a balanced machine each 


resource is utilized to its fullest. The balance con- 
straints greatly reduce the search space, and thus the 
complexity of the optimization procedure. 

2. Balanced local communication with computa- 
tion. 


T, =T, (20) 


3. Balanced global communication and computa- 
tion. 


T, =, (21) 


4. Memory on processor element must fit the mem- 
ory required for a block. Besides the memory re- 
quired for actual computations R?,, buffers for local 
and global communications R!,, R®, are allocated be- 
cause of overlapping conditions. 


m= Ry = R2,+ Ri, + RY, (22) 


3 Analysis 


In this section, we study a set of applications in 
the context of the framework presented. The appli- 
cations are: Jacobi Relaxation, Dense Matrix Multi- 
ply, Nbody, FFT, Largest Common Subsequence. We 
chose these applications becouse they are diverse and 
require conflicting machine performances to run ef- 
ficiently. The optimization procedure has been im- 
plemented in Mathematica. We use a 3 cycle soft- 
ware overhead and a 100 cycle DRAM access latency. 
We also counted an 8Kbyte SRAM-based instruction 
cache per node. 

In all the experiments we used a budget of 1 bil- 
lion SRAM bit equivalents or the area required for 1 
billion logic transistors. This budget is achievable in 
10-12 years as projected by the Semiconductor Indus- 
try Association (SIA) given a 10-20% growth rate per 
year of die areas and a growth rate in transistor counts 
of between 60 and 80% per year due to increasing den- 
sities. 


Application specific results. Figure 7 shows the 
optimal division of chip resources for the various ap- 
plications as a function of problem size. The optimal 
amount of each resource is shown in greater detail in 
Figures 8 through 12. Table 4 summarizes the optimal 
configurations and chip sizes. 

Perhaps the most important result from Figure 7 
is that the amount of area devoted to processing and 
local communication is more or less constant at about 
75 percent for all the applications and problem sizes. 


The global communication bandwidth of 30 words 
per cycle is the maximum achievable given a packag- 
ing technology allowing 2000 pins. The only applica- 
tion that is I/O limited and requires this bandwidth 
is FFT. All the other applications have a negligible 
area allocated to global communication. The total 
chip area for global communication is relatively small 
even for FFT. Therefore, providing the maximum pos- 
sible global bandwidth is not a bad idea in a final con- 
figuration. 

As we can see, the relative communication area re- 
quired is small in applications such as Jacobi and LCS 
as they also show good spatial locality. These appli- 
cations can use most of the resources for processing. 
FFT and Nbody require the largest communication 
area with an optimal communication bandwidth be- 
tween 4 and 5 words per cycle. The division between 
processing and memory areas is uniform. 

The matrix multiplication based on Connor’s mem- 
ory efficient blocking algorithm gives the most uni- 
formly divided configuration. For this application, 
memory, local communication and processing areas 
are approximately equal. 

The amount of memory per node obtained is rela- 
tively small compared to modern day multiprocessors 
in all applications. The reason is twofold. First, the 
total amount of memory in the entire Raw chip is still 
quite large, since it is the product of P and m. Sec- 
ond, fast local communication obviates the need for 
huge amounts of local memory. The matrix multipli- 
cation required the largest amount of memory giving 
a total of 24 Kbytes per node. The smallest memory 
is required for Nbody. 

For all the applications the optimal processing 
power obtained is equivalent to a single-issue proces- 
sor. The total number of processors P varied between 
1100 to 2310 for large problem sizes. 

Although the optimal machine configurations ob- 
tained vary for different applications, problem sizes 
and budgets, the general trends for various applica- 
tions are similar. Accordingly, for the applications 
studied, assuming an 1 billion logic transistor equiv- 
alent area, we recommend building a Raw chip with 
approximately 1000 tiles, 30 words/cycle global I/O, 
20Kbytes of local memory per node, 3-4 words/cycle 
local communication bandwidt, and single-issue pro- 
cessors. This configuration will give performance near 
the global optimum for most applications. 


Sensitivity of grain size. The framework helps an- 
swer many other questions about machine configura- 
tions. Let us study the sensitivity of performance to 
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Figure 7: Breakdown of chip areas for processing, memory, local communication and global communication that 
give optimal machine configurations for a budget of 1 billion logic transistor equivalent area. 
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Figure 8: Number of processors in optimal machine 
configurations for different problem sizes. 


Figure 10: Local SRAM data memory m per node in 
optimal machine configurations. 
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Figure 9: Global io bandwidth b, in optimal machine 
configurations for different problem sizes. 


Figure 11: Local communication bandwidth c in opti- 
mal machine configurations for different problem sizes. 
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static scheduling of the communication channels the 
need for deep FIFO’s is reduced. 

The question is how much do these changes im- 
pact the performance of applications given perfor- 
mance/cost optimal partitioning of resources in both 
cases? Figure 13 shows the performance ratio between 
the second and the first designs. It is easy to see that 
the larger amount of on-chip memory in case (2) re- 
sults in significantly higher performance. 


Figure 12: Processing power pin optimal machine con- 
figurations for different problem sizes. 


the machine configuration near the optimum machine 
configuration point. This study is useful to determine 
a machine configuration that is robust across many 
applications. As an example, let us determine the ma- 
chine configuration with the smallest number of nodes 
whose performance is within 25 percent of the optimal 
configuration. 

Results are shown in Table 5. For each applica- 
tion, the first row gives the optimal configuration. The 
second row gives the configuration with the smallest 
number of nodes under the condition that the perfor- 
mance is no worse than 25 percent of the optimal. As 
we can see, balanced machine configurations with less 
nodes usually take advantage of the parallelism avail- 
able in superscalar processors. However, for all the 
applications studied the configuration that gave best 
performance used nodes based on 2-way superscalars 
at most. 


Design comparisons. The framework also allows us 
to compare competing designs for the same budget. 
As an example, let us compare the two designs: (1) 
using on-chip SRAM and routers with 16flit FIFOs, 
and (2) using only a small SRAM cache and the rest 
of memory in on-chip DRAM as well as small 2flit 
FIFOs. We derive the performance/cost optimal con- 
figurations and look to application performance for 
different problem sizes. 

Since DRAM densities are much higher than SRAM 
densities we can have more memory per node in alter- 
native (2). One problem in using DRAMs is that the 
access latency that is much higher than correspond- 
ing SRAMs. To reduce the impact of the latency, we 
include a small SRAM cache in each node and as- 
sume that the SRAM cache results in a near perfect 
hit rate. Case (2) also has small FIFOs — With good 
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Figure 13: Performance comparison between two cost 
optimal designs each with a budget of 1 billion logic 
transistor area. In the first design we use SRAM and 
routers with 16 flits FIFOs, while in the second design 
we use on-chip DRAM with a 1Kbyte SRAM cache 
per node and 2 flit FIFOs. 


4 Conclusions 


This paper provides a framework for reasoning 
about single chip microprocessors such as Raw with 
replicated, fine-grain processing elements. The frame- 
work uses a machine characterization that considers 
processing, memory, local and global communication, 
and latency as separate machine resources. This is a 
unique characterization of machine space since it cap- 
tures the effects of locality by treating local and global 
communication separately. The framework incorpo- 
rates a cost model based on empirical observations 
and statistics gathered on current implementations of 
superscalars and router chips. 

The framework recognizes the importance of bal- 
ance in good design, and integrates this idea with a 
cost and performance model to provide a useful de- 
sign tool. Having provided this framework, this paper 
chooses a diverse application suite in order to exercise 
the framework and to address some general questions 


[ Problem [ Size] Po [Toc [ p [om [ b | [PRK TP 


FMatmul | 10" | 1200] 26 [125] 1640] 3.9 | | 98 | 35 | 28 | 0.02 | 0.01 | 
[Matmul | 10° [1290 | 2.3 [125 [230] 33 | | 35 | 30 | 33 | 0.02 | 0.01 | 
[ Matmul | 107 [724 | 8 [is | o7 | ss | | 2 | 6 | 9 | 005 | 001 | 
[Jacobi | 10" | 2180 | 010 [125] 404 | 44 | | 00 | 74 | 326 | 0.03 | 0.01 | 
| Jacobi_| 10° [aii_| 019 [125] 502 | 42 | | oo | 74 | 326 | 0.03 | 0.01 | 
[Jacobi_| 107 [i950 | 1 [a2] 25 | 21 | | 83 | 22 | 23 | 016 | 001 | 
[-Nbody | 10 [i100] 5 | 1 | 1 | 006 | | 26 | 00 | 13 | 0.0004 | 0.01 | 
[Nbody_| 10° [ioso [5 | 1 | o7 | 006 | | 26 | 60 | 13 | 0.0004 | 0.01 | 
[Nbody_| 107 [1070 | 5 | 1 | 8 | 05 | | 26 | 60 | 13 | 0.004 _| 0.01 | 
[FFT | 10° | i1o0 | 42 [125] 17s | 30 | | si | 50 | 18 | 02 | 001 | 
[FFT_| 10° [iseo | 42 [125 | ire | 30 | | si | 50 | 15 | 02 | 001 | 
[FT| 10" [ito | 42 [125 | ite | 30 | | 31 | 50 | 18 | 02 | 001 | 
Los | 10" [2310 | 001 | 125] 337 | oois| | «3 | 37 | 32 | 0.0001 | 0.01 | 
[Los | 10? [2330 [0.01 [1.25 | 291 [oo | | 63 | 37 | 32 | 0.0001 | 0.01 | 
[Los] 10" [2290 [0.25 | 15 | 20 | 025 | | 33 | 9 | 27 | 0.002 | 0.01 | 


Table 4: Breakdown of resources and optimal machine configurations for three problemsizes. Columns P to by, 
represent the optimal machine configuration and the columns from PK, to Kj, are the chipsizes in percent of 
the total cost. 


[Pebea | Po as) ee ae) te ee) Phe | Pie ey pel 
[Tacobi_[ 2180 | 302230 [019 [125 | 46% | 44 | 60 | 74 | 326 | 001 | 001, 
[Nobody [1100 [ e306 | 5 | 1 | o1 | 006 | 2 | 60 | 13 | 0.0002 | 001, 


Table 5: Solutions that come within 25% of the optimal for a problem size of 10° with the smallest number of 
nodes P. The first row of each application shows the global optimum, and the second row shows the solution with 
the minimum number of processors and performance within 25% of the optimal. The numbers in parentheses 
show the performance degradation compared to the global optimum for the configurations with the minimum 
processors. The first columns between P to b, represent the optimal machine configuration and the columns from 
PK, to Kig are the chipsizes in percent of the total cost. 


in parallel computer design in general. More specif- 
ically, it addresses the questions of on-chip resource 
division in the MIT Raw microprocessor. 

Although the optimal machine configurations vary 
for different applications, problem sizes and budgets, 
the general trends are consistent. The framework rec- 
ommended that chip designers devote about 75 per- 
cent of the chip area to processing and local com- 
munication. The framework further suggested that 
for the applications studied and assuming an 1 bil- 
lion logic transistor equivalent area, designers should 
build a system with about 1000 nodes, 30 words/cycle 
of global 1/0, 20Kbyte of local memory per node, 
3-4 words/cycle local communication bandwidth and 
single-issue processors for optimal performance. 
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